A PRISMA systematic review of adolescent gender dysphoria literature: 3) treatment

It is unclear whether the literature on adolescent gender dysphoria (GD) provides evidence to inform clinical decision making adequately. In the final of a series of three papers, we sought to review published evidence systematically regarding the types of treatment being implemented among adolescents with GD, the age when different treatment types are instigated, and any outcomes measured within adolescence. Having searched PROSPERO and the Cochrane library for existing systematic reviews (and finding none at that time), we searched Ovid Medline 1946 –October week 4 2020, Embase 1947–present (updated daily), CINAHL 1983–2020, and PsycInfo 1914–2020. The final search was carried out on 2nd November 2020 using a core strategy including search terms for ‘adolescence’ and ‘gender dysphoria’ which was adapted according to the structure of each database. Papers were excluded if they did not clearly report on clinically-likely gender dysphoria, if they were focused on adult populations, if they did not include original data (epidemiological, clinical, or survey) on adolescents (aged at least 12 and under 18 years), or if they were not peer-reviewed journal publications. From 6202 potentially relevant articles (post deduplication), 19 papers from 6 countries representing between 835 and 1354 participants were included in our final sample. All studies were observational cohort studies, usually using retrospective record review (14); all were published in the previous 11 years (median 2018). There was significant overlap of study samples (accounted for in our quantitative synthesis). All papers were rated by two reviewers using the Crowe Critical Appraisal Tool v1·4 (CCAT). The CCAT quality ratings ranged from 71% to 95%, with a mean of 82%. Puberty suppression (PS) was generally induced with Gonadotropin Releasing Hormone analogues (GnRHa), and at a pooled mean age of 14.5 (±1.0) years. Cross Sex Hormone (CSH) therapy was initiated at a pooled mean of 16.2 (±1.0) years. Twenty-five participants from 2 samples were reported to have received surgical intervention (24 mastectomy, one vaginoplasty). Most changes to health parameters were inconclusive, except an observed decrease in bone density z-scores with puberty suppression, which then increased with hormone treatment. There may also be a risk for increased obesity. Some improvements were observed in global functioning and depressive symptoms once treatment was started. The most common side effects observed were acne, fatigue, changes in appetite, headaches, and mood swings. Adolescents presenting for GD intervention were usually offered puberty suppression or cross-sex hormones, but rarely surgical intervention. Reporting centres broadly followed established international guidance regarding age of treatment and treatments used. The evidence base for the outcomes of gender dysphoria treatment in adolescents is lacking. It is impossible from the included data to draw definitive conclusions regarding the safety of treatment. There remain areas of concern, particularly changes to bone density caused by puberty suppression, which may not be fully resolved with hormone treatment.

Introduction showed in a UK sample that desistance was more likely in those aged 15 years or under than those 16 or over (9.2% vs 4.4%) [19]. But there is a lack of relevant recent follow-up studies. There is an explicit lack of evidence on adolescent-onset GD, so the interplay between GD and other mental health (MH) factors in this phenomenon is not well understood [20].
Intense international debate regarding a number of issues relating to GD intervention in adolescence is ongoing, especially within Europe and North America where the main research active treatment centres are based [20]. One recent high profile legal case (Bell vs Tavistock [21]) attracted considerable attention from people and organisations with a range of strongly-held views both in favour of and against the ruling, illustrating the acknowledged lack of good quality evidence regarding treatment comorbidities and outcomes to inform service design [22,23]. Concern in the UK led to the commissioning of the Cass review, which in March 2022 recommended the Tavistock Gender and Identity Development Service (GIDS) be closed down in favour of developing regional specialist centres [24] and highlighted the need not only for much better evidence but also for clinicians to be better informed and willing to work toward positive change [25]. Services are sometimes left having to make unilateral decisions against national guidance, e.g., the Karolinska University Hospital in Stockholm, Sweden, changing their policy to limit puberty suppression to the context of clinical studies [26], and recently a new version of World Professional Association for Transgender Health (WPATH) standards of care have removed the lower age limit for certain treatments, leaving this decision to clinician judgement [27,28].

Scope of the review
This review is the third in a three-part series addressing the current state of evidence on gender dysphoria experienced in adolescence. Our over-arching aim was to establish what the literature tells us about gender dysphoria in adolescence. We broke this down into seven specific questions (see below). Paper 1 [1] addressed questions 1-3c (italicised), Paper 2 [2] addressed question 4 (plain text), and Paper 3, the current paper, addresses questions 3d and 5-7 (bold text). Our overall research aim was addressed by seven specific questions: uploaded on 2 nd February 2021 to include specific detail on age criteria and clinical verification of condition. The review has been prepared according to PRISMA 2020 [29] guidelines (see Table 1 for checklist).

Eligibility criteria
The volume of non-peer-reviewed literature in initial searches proved so great that we took the decision to only include peer-reviewed journal papers featuring original research data. This decision was made subsequent to initial PROSPERO registration, but prior to full text screening. Complete inclusion criteria were: • Focused on gender dysphoria or transgenderism; • Includes data on adolescents (aged 12-17 years inclusive); • Includes original data (not review paper or opinion piece); • Peer-reviewed publication (not theses or conference proceedings); • In English language.

Information sources
We searched PROSPERO and the Cochrane library for existing systematic reviews. We searched Ovid Medline 1946 -October week 4 2020, Embase 1947-present (updated daily), CINAHL 1983-2020, and PsycInfo 1914-2020. After selecting the final sample of articles, the first author used their reference lists as a secondary data source.

Search
The final search was carried out on 2 nd November 2020 using a core strategy which was adapted according to the structure of each database. The core strategy included search terms for 'adolescence' and 'gender dysphoria'. This was kept deliberately broad in order to ensure any studies on the subject could be screened for eligibility. The specific search strategy employed in EMBASE is given below, and represents the format followed with the others. The specific search strategies employed in each database are detailed in Table 2. Objectives 4 Provide an explicit statement of the objective(s) or question(s) the review addresses. 5

METHODS
Eligibility criteria 5 Specify the inclusion and exclusion criteria for the review and how studies were grouped for the syntheses.

6-7
Information sources 6 Specify all databases, registers, websites, organisations, reference lists and other sources searched or consulted to identify studies. Specify the date when each source was last searched or consulted. 6 Search strategy 7 Present the full search strategies for all databases, registers and websites, including any filters and limits used.
6, Table 2 Selection process 8 Specify the methods used to decide whether a study met the inclusion criteria of the review, including how many reviewers screened each record and each report retrieved, whether they worked independently, and if applicable, details of automation tools used in the process.

7-8
Data collection process 9 Specify the methods used to collect data from reports, including how many reviewers collected data from each report, whether they worked independently, any processes for obtaining or confirming data from study investigators, and if applicable, details of automation tools used in the process. 8 Data items 10a List and define all outcomes for which data were sought. Specify whether all results that were compatible with each outcome domain in each study were sought (e.g. for all measures, time points, analyses), and if not, the methods used to decide which results to collect. 8 10b List and define all other variables for which data were sought (e.g. participant and intervention characteristics, funding sources). Describe any assumptions made about any missing or unclear information.
8 Study risk of bias assessment 11 Specify the methods used to assess risk of bias in the included studies, including details of the tool(s) used, how many reviewers assessed each study and whether they worked independently, and if applicable, details of automation tools used in the process. 8 Effect measures 12 Specify for each outcome the effect measure(s) (e.g. risk ratio, mean difference) used in the synthesis or presentation of results. 8 Synthesis methods 13a Describe the processes used to decide which studies were eligible for each synthesis (e.g. tabulating the study intervention characteristics and comparing against the planned groups for each synthesis (item #5)). 8 13b Describe any methods required to prepare the data for presentation or synthesis, such as handling of missing summary statistics, or data conversions. 9 13c Describe any methods used to tabulate or visually display results of individual studies and syntheses. [8][9] 13d Describe any methods used to synthesize results and provide a rationale for the choice(s). If metaanalysis was performed, describe the model(s), method(s) to identify the presence and extent of statistical heterogeneity, and software package(s) used.

8-9
13e Describe any methods used to explore possible causes of heterogeneity among study results (e.g. subgroup analysis, meta-regression).

N/A
13f Describe any sensitivity analyses conducted to assess robustness of the synthesized results. N/A Reporting bias assessment 14 Describe any methods used to assess risk of bias due to missing results in a synthesis (arising from reporting biases).

N/A
Certainty assessment 15 Describe any methods used to assess certainty (or confidence) in the body of evidence for an outcome.

Study selection
16a Describe the results of the search and selection process, from the number of records identified in the search to the number of studies included in the review, ideally using a flow diagram.
7, Fig 1   16b Cite studies that might appear to meet the inclusion criteria, but which were excluded, and explain why they were excluded. Table 4 (Continued )

Study selection
The study selection process is illustrated in Fig 1. We used Endnote v. X7.8 to manage all references, and followed the de-duplication and management strategies set out in Bramer et al. (2016) [30] and Peters (2017) [31] respectively. In the first stage of screening, papers were excluded based on their title or abstract if they did not clearly report on gender dysphoria or transgenderism and if they were focused on adult populations. In the second stage of screening, papers were excluded on the basis of title and abstract if they did not include original data (epidemiological, clinical, or survey) on adolescents (aged at least 12 and under 18 years). At both stages papers were retained if there was insufficient information to exclude them.
Full-text files were obtained for the remaining records.
Papers were rejected at this stage if they:

Section and Topic Item # Checklist item Location where item is reported
Study characteristics 17 Cite each included study and present its characteristics. Table 3 Risk of bias in studies 18 Present assessments of risk of bias for each included study. Table 9; Fig 2 Results of individual studies 19 For all outcomes, present, for each study: (a) summary statistics for each group (where appropriate) and (b) an effect estimate and its precision (e.g. confidence/credible interval), ideally using structured tables or plots. Tables 5-8 Results of syntheses 20a For each synthesis, briefly summarise the characteristics and risk of bias among contributing studies.

9-10
20b Present results of all statistical syntheses conducted. If meta-analysis was done, present for each the summary estimate and its precision (e.g. confidence/credible interval) and measures of statistical heterogeneity. If comparing groups, describe the direction of the effect. 23b Discuss any limitations of the evidence included in the review. [17][18] 23c Discuss any limitations of the review processes used. [17][18] 23d Discuss implications of the results for practice, policy, and future research. 18

OTHER INFORMATION
Registration and protocol 24a Provide registration information for the review, including register name and registration number, or state that the review was not registered.
6 24b Indicate where the review protocol can be accessed, or state that a protocol was not prepared. 6 24c Describe and explain any amendments to information provided at registration or in the protocol. N/A Support 25 Describe sources of financial or non-financial support for the review, and the role of the funders or sponsors in the review.

Submission system
Competing interests 26 Declare any competing interests of review authors. 19 Availability of data, code and other materials 27 Report which of the following are publicly available and where they can be found: template data collection forms; data extracted from included studies; data used for all analyses; analytic code; any other materials used in the review.
Template: 8 Data: Tables 5-8 • Contained no original data (including literature and clinical reviews, journalistic / editorial pieces, letters and commentaries); • Included only case studies or selected case series; • Pertained to conditions other than GD (e.g., Disorders of sexual development or HIV); • Did not include clinically-identified GD (e.g., survey where participants self-identify, with no clinical contact); • Pertained to populations other than those with GD (e.g., LGBTQ more broadly); • Pertained to populations including or restricted to those aged 18 years or older. This included papers where adolescents and adults were included in the same sample, but adolescents were not separately reported (in many cases age range was not reported and so a 'balance of probabilities' assessment had to be made based on the reported mean age); • Pertained to populations restricted to those aged under 12 years of age. This included papers where adolescents and children were included in the same sample, but the majority of participants were clearly under 12 (based on mean or median age); • Where participants were practitioners, not patients; • Referred only to conference proceedings; • Were written in a non-European language (e.g., Turkish); • Could not be obtained (including due to being published in non-English language journals, or in theses).   Following initial full text screening, all remaining papers were assessed by a second reviewer to reduce the risk of inclusion bias. Where reviewers reached a different conclusion, discussion took place to reach consensus. If agreement could not be reached, a third reviewer was consulted, and discussion used to reach consensus amongst all three reviewers.
Data extracted from eligible papers were tabulated and used in the qualitative synthesis. Given the limited number of specialist treatment centres globally, we assessed how many of the included papers featured the same or overlapping samples.
Papers included in the sub-sample for the present analysis contained some data on either age at treatment commencement, type of treatment administered, or treatment outcomes.

Quality assessment
All papers were rated by two reviewers using the Crowe Critical Appraisal Tool v1�4 (CCAT [32]). CCAT is suitable for a range of methodological approaches, assessing papers in terms of eight categories: Preliminaries (overall clarity and quality); Introduction; Design; Sampling; Data collection; Ethical matters; Results; Discussion. Each category is rated out of 5 and all eight categories summed to give a total out of 40 (converted to a percentage). In the present review, each paper was then assigned to one of five categories, based on the average rating of the reviewers, where a rating of 0-20% was coded 1 (poorest quality), and 81-100% coded 5 (highest quality). Inter-rater reliability was shown to be very good (k = 0�93, SE = 0�07).

Data collection process
Data were extracted from the papers using the CCAT form (https://conchra.com.au/wpcontent/uploads/2015/12/CCAT-form-v1.4.pdf) by two reviewers per paper and compiled by the first author (LT). Once compiled, instances of overlap between papers (i.e., if the same sample was described in two papers) were identified and tabulated, and the final sample for each question defined.

Number of studies included, retained and excluded
The PRISMA diagram in Fig 1 provides details of the screening and exclusion process. The searches returned 8655 results, reduced to 6202 following de-duplication. Titles and abstracts were screened by one reviewer (LT) and 4659 records excluded after initial screening and a further 699 excluded on second stage title / abstract screening. This left 553 eligible for full text screening. An initial screening (LT) of full texts reduced the number of records to 155. Fortyseven papers were included in the final dataset, of which 19 included data for the present paper. Full characteristics of included studies are provided in Table 3.

Study characteristics
All of the included studies originated from a small number of centres in wealthy nations: USA (n = 6), Netherlands (n = 5), UK (n = 3), Belgium (n = 3), Germany (n = 1), and Israel (n = 1). The Belgian studies came from the same centre, an adolescent gender clinic in Ghent. The Dutch studies were from the same centre in Amsterdam. The UK studies consisted of samples assessed at the Gender Identity Development Service in London. As such there may be overlap in the samples studied but this was not always clearly reported. Both Klaver et al. papers [33,34] examine the same cohort but investigate different parameters. There is partial overlap reported in Tack [36,37], however the degree of overlap is not fully described. Overlap was therefore estimated using dates of record search or dates of inclusion. Fig 2 provides a graphic representation of likely sample overlap. All papers included were published within the last eleven years, with the earliest having been published in 2011 (de Vries et al. [38]). Reported dates of treatment ranged from 1998 to 2018 inclusive. The majority (n = 13) of studies were retrospective in nature, and predominantly took their data from review of medical records. None of the studies included in this paper were randomised controlled trials (RCTs). Most papers (n = 14) contained both NF and NM participants. Three (Perl et al. [39], Stoffers et al. [40], Tack et al., 2016 [36]) contained data pertaining to NF only, and one (Tack et al., 2017 [37]) contained data pertaining to NM only. In total, approximately 1300 adolescents who had been treated for GD were included in this analysis. Of this, around 1102 were treated with puberty suppression (PS), and 727 were treated with CSH, either in addition to PS (n = 506) or as monotherapy (n = 221).
All nineteen papers included data on the treatments offered for adolescent GD. All nineteen included data on PS, of which six [35,[41][42][43][44][45] focused on this exclusively. Of those focusing exclusively on PS, four [42][43][44][45] analysed GnRHa use, one (Tack et al. 2018) [35] analysed the effects of progestins exclusively, and one (Lee et al.) only contained treatment used and age of treatment data [41]. Thirteen papers looked at the effects of both PS and CSH. Of these, nine looked at the effects of GnRHa and CSH, two looked at the effects of progestins and CSH [36,37] and one (Chen et al.) [46] looked at the effects of GnRHa, progestins, androgen-receptor blockers and CSH. Eight analysed both oestrogen and testosterone therapies, and two [39,40] analysed testosterone solely.
Patients in most papers (n = 15) had a GD diagnosis according to either the DSM-IV or V definition of Gender Dysphoria. Two papers [42,47] used ICD-9/10 definitions. Perl et al. [39] did not describe how patients were diagnosed with GD, but did state the participants had sought out and had been treated for GD at a Paediatric Gender Dysphoria Clinic. Likewise Jensen et al. [48] used data from adolescent participants who had received or were receiving CSH therapy at a paediatric gender clinic for GD.
A substantial group of papers narrowly missed inclusion criteria, mostly on the age criterion and some on the clinically likely GD criterion, and were not included in the final sample of reviewed papers. We documented characteristics of all studies excluded at the final full text screen in Table 4.

Overall findings based on included studies
What is the pattern of age at treatment? What treatments have been used to address GD in adolescence?. The age of initiation of PS was explicitly stated in 17 papers. After accounting for sample overlap and only including those papers reporting mean and SD, the final age calculation is from 11 papers. The pooled mean age at which PS was started was 14.5 (±1.0) years, with the lowest treatment age 8.8 years old [42]. Only five papers were included in the calculation of mean age of initiation of CSH therapy (16.2±1.0). The lowest age of CSH initiation was reported at 13.2 years [49]. This mean included those who initially had PS monotherapy before adding CSH therapy, and those who were started on CSH monotherapy. Age data used in this analysis are presented in Tables 5 and 6.
The method of PS was fairly consistent for this age group. Most (15/19) papers described the use of GnRHa. Only three (all Tack et al. [35][36][37]) did not use GnRHa, however they did state that common practice in their centre is to prescribe GnRHa to patients who present at a less advanced stage of puberty. Kuper et al. [49] does not record what method was used for PS. In Chen et al. (2016), only 39.5% (n = 15) received PS treatment. Of those 15, 6 (40%) were denied insurance approval for GnRHa, so second-line alternatives, i.e., progestins and antiandrogens, were prescribed. From the patients who received PS treatment, only 21% were prescribed CSH at a recommended age (around 16 years old). Most studies with CSH data included both NF and NM participants, so the effects of both oestradiol and testosterone could be reported. One study indicated that, as expected, lower doses of oestradiol and testosterone cypionate were required for individuals who had GnRHa before starting with the CSH treatment [48].     High levels of anxiety and depression.
(Continued )      Association between GD symptoms in ADHD (hyperactive-impulsive) and CD where parenting stress high.
(Continued )        Significantly more adolescents (6.5%) with ASD endorsed item expressing wish to be the opposite gender compared to the general population (3-5%). NF endorsed more then NM. Adolescents with ASD who endorsed gender item had higher YSR scores (poorer MH). No association with any specific subdomain of ASD. all < 12 years 52% of GID children had one or more other diagnoses. Internalising problems more common (37%) than externalising (23%). 31% of GID group had anxiety disorder.
(Continued )   2015) indicated that 7.5% of patients were referred to mental health services due to possible psychiatric comorbidities, which did not allow the start of medical intervention, and they received psychotherapy until they were eligible for PS treatment [43]. Becker-Hebly et al. (2020) indicated that the patients that were only undergoing psychosocial intervention (28%) were generally considered for medical treatment, but they needed to address psychological problems first [50].
Gender-affirming surgery (GAS) was not routinely offered in this young sample. Two papers reported cases obtaining GAS, with one reporting 15 NF adolescents obtaining mastectomy surgery, at an average age of 17.2 years old (range 15.2-18.7 years old) [49]. One centre in Germany mentions 11 patients obtaining GAS (14 mastectomy, 1 vaginoplasty) at a mean age of 18.0 years old (16.0-19.6 years old) [50]. Neither paper details how this surgery was obtained or where it was carried out. No other centres reported GAS in their populations.
What outcomes are associated with treatment/s for GD in adolescence?. Blood pressure, biochemistry and haematology A range of cardiovascular and laboratory indicators of health were measured within the included studies. This was usually done on an exploratory basis to establish the need for continued monitoring in future patients. Three papers [34,39,40] analysed treatment effects on blood pressure, four papers [34,36,37,40] analysed effects on lipid profiles and effects on insulin resistance. Treatment outcomes in this category are presented in Tables 7 and 8.
Blood Pressure. Three papers [34,39,40] analysed the effects of GD treatment on blood pressure-both systolic (SBP) and diastolic (DBP). One paper [34] analysed the effects of both GnRHa and CSH on blood pressure. Two papers [39,40] looked at BP changes with GnRHa and testosterone use, and therefore only included NF.
For GnRHa-induced PS, two [34,39] reported a significant increase in DBP (Klaver et al. [34] found only an effect in NM, with no effect in NF), whereas one (Stoffers et al. [40]) reported no change. There were no significant changes in SBP recorded in any of the papers during GnRHa treatment. There were no cases of hypertension with GnRHa reported in any of the included papers. From the papers that analysed the effects of CSH the results are similarly inconclusive. One paper [39] reported a decrease in DBP with testosterone to pre-  treatment levels following a rise during GnRHa treatment. Stoffers et al. [40] reported a significant increase in SBP with testosterone use, however this followed an observed reduction over the preceding GnRHa treatment period so in fact represented a return to baseline. Klaver et al.
(2020) [34] reported a significant increase in both DBP and SBP after addition of testosterone. Additionally, they reported a significant increase in DBP on the addition of oestrogen. It should be noted that Klaver et al. (2020) [34] measured the endpoint data in their cohort at the age of 22 years, and so is the only source of long-term data present. Overall, BP does not seem to be adversely affected by either GnRHa or CSH treatment. At no point in any of these papers did the BP measurements stray from the normal range. Lipid Profile. The effects of GD treatment on lipid profile was analysed in four papers; one [34] looked at effects of GnRHa and CSH in a both NF and NM, one looked at the effects of GnRHa and testosterone [40], one looked at the effects of lynestrenol (a progestin) and testosterone [36], and one looked at the effects of cyproterone acetate (an antiandrogen and progestin) and oestradiol [37].
Of the two papers that included data on the effects of GnRHa, one [34] found that GnRHa produced a significant increase in total cholesterol, low-density lipoprotein (LDL) and highdensity lipoprotein (HDL). There was no significant change in triglycerides. The other [40] found that GnRHa treatment produced no significant effect on lipid profile. For CSH treatment, both papers found that testosterone produced a significant decrease in HDL. Stoffers et al. [40] found that other lipid parameters were unaffected, whereas Klaver et al. (2020) [34] noted a significant increase in total cholesterol, LDL and triglycerides. They also noted that oestrogen produced a significant increase in triglycerides. These changes were compared to trends amongst the non-GD population, and were found to be similar, with no significant differences between treated NF and non-treated cisgender men, and likewise for NM and nontreated cisgender women. Tack et al. (2016) [36] found that lynestrenol produced a significant increase in LDL in the first 6 months of treatment, which then stabilised. It also produced a significant decrease in HDL. There were no significant changes to lipid profile with the addition of testosterone. Tack et al. (2017) [37] found that cyproterone acetate produced a significant decrease in total cholesterol, HDL and triglycerides. The addition of oestradiol produced no significant changes in lipid profile. Glucose metabolism. Four papers measured parameters indicating glucose metabolism (GnRHa and CSH [34], testosterone [40], and progestins and CSH [36,37]). They measured blood glucose levels, insulin levels and calculated the Homeostatic Model assessment of insulin resistance (HOMA-IR). The two Tack et al. [36,37] papers also measured haemoglobin A1c (HbA1c) levels. Stoffers et al. [40] only measured HbA1c levels. These three papers noted no significant changes in any measure of insulin sensitivity with either PS (GnRHa or Progestin) or CSH. Klaver et al. (2020) [34] followed up their population into adulthood (22y) and found that there was a significant decrease in insulin and HOMA-IR after testosterone treatment. When compared to cisgender men, it was not significantly different. They did find that HOMA-IR of NM was higher than that of cisgender women. It must be noted that only two [34,36] papers provided data, whereas the other two [37,40] only mentioned the non-significance of their findings in the text.
Haemoglobin. Three papers [36,37,40] measured the haemoglobin (Hb) of participants. The normal range is higher in men than in women. Tack et al. [36] (2016) analysed the effects of lynestrenol, finding a significant increase in mean Hb values. Two papers [36,40] analysed the effects of testosterone and found increases in mean Hb, with Stoffers et al. [40] describing six individuals with haematocrit exceeding the upper male limit. These six were on an accelerated hormone program, and their values normalised by the end of follow-up without intervention or change of therapy. Tack et al. (2017) [37] analysed the effects of cyproterone acetate and oestradiol, finding a significant decrease in Hb with cyproterone acetate, and no further changes with oestradiol.
Liver Enzymes. Four papers analysed changes in liver enzymes. Jensen et al. [48] simply noted whether there was any change but did not provide values nor which enzymes were measured. Two studies [36,40] analysed the effects of testosterone and found differing increases in enzymes. Stoffers et al. [40] analysed alanine aminotransferase (ALT), aspartate aminotransferase (AST), gamma-glutamyl transferase (GGT) and alkaline phosphatase (ALP) and noted a significant increase only in ALP. Tack et al. (2016) [36] noted an increase in ALT and AST. These increases in ALT and AST remained within reference ranges for men. Stoffers et al. [40] had used GnRHa for PS, whereas Tack et al. (2016) [36] had used Lynestrenol. They did note that with Lynestrenol there was a significant increase in ALT, however this remained within reference ranges. Additionally, one patient's ALT went above the upper limit, yet this normalised with the addition of testosterone. Two studies [37,48] analysed the effects of oestrogen. Jensen et al. [48] noted that two NMs had raised liver enzymes during oestrogen hormone therapy (although no values reported). Tack et al. (2017) [37], who reported the effects of cyproterone acetate and oestrogen, noted no significant increases in AST or ALT. They did note that five NMs had transiently increased levels, but none reach the threshold of 3-times the original value, which would indicate a need to stop treatment.
Anthropometric, bone density, and physical changes. Height / weight / BMI Seven studies included data on height and weight and / or Body Mass Index (BMI) [33, 35-37, 39, 40, 44] in relation to PS (Table 9). Findings were mixed, with no remarkable changes to anthropometry noted in any of the studies. Where weight or height increased it was generally in line with normal development for age and affirmed gender. An exception was reported in Joseph et al. [44] where NM had a higher BMI at baseline which increased more than that for NF. This was from a very small sample, however (n = 10).
Six papers reported data on anthropometry in relation to CSH [33,34,36,37,39,40], with two of these being explicitly on the same sample at different time points (Klaver's studies [33,34]) (see Table 10). More changes were noted for this stage of treatment, although generally in line with normal development for affirmed gender. Klaver (2020) noted a higher prevalence of obesity in their follow-up at 22 years old in both NM and NF participants. Tack et al. (2016) [36] also noted significant weight gain in NF (but not in NM in their 2017 paper [37]).
Other measures such as waist to hip ratio, lean body mass and body fat percentage were employed by some papers, but no remarkable findings were noted.
Bone density. Five papers contained data relating to bone health of which one (Lee et al. [41]) focussed on a pre-treatment population (n = 63) and so was not included in this analysis. The four remaining papers all measured areal bone mineral density (BMD or aBMD; g/cm 2 ) with three also measuring bone mineral apparent density (BMAD; g/cm 3 )-a size adjusted value incorporating body size measurements-using dual energy X-ray absorptiometry (DXA) scanning. These papers also all compared obtained Z-scores with reference ranges of agematched peers of the same birth sex as the GD adolescents being studied (i.e., compared Zscores of NF with cisgender females).
BMD in relation to GnRHa treatment was examined in three papers. Two papers reported that the absolute values for BMD did not change significantly [44,51] with GnRHa usage, whereas Stoffers et al. [40] reported that BMD significantly decreased compared to pre-treatment values. The absolute values of BMAD significantly decreased in one paper [51] but did not change significantly in the other [44]. GnRHa treatment resulted in significant decreases in both BMD and BMAD Z-scores across all three papers. Additionally, Schagen et al. [51]  reported on a small cohort (n = 15) of four NMs (mean age 12.6) and 11 NFs (mean age 12.7) who were on prolonged (3 years) GnRHa treatment and found that absolute BMD values at the lumbar spine and hip remained stable, but z-scores did decline. CSH effects on bone health were investigated in two papers [40,51]. Both papers identified a significant increase in BMD absolute values after hormonal treatment, with Schagen et al. [51] also noting a significant increase in BMAD absolute values. Both noted significant increases in BMD Z-scores, with the Z-scores returning close to 0. However, both described those values in their NM groups remained significantly below 0. Tack et al. (2018) [35] analysed at the effects of lynestrenol and cyproterone acetate rather than GnRHa. In the lynestrenol group absolute aBMD values remained stable. There was a significant increase in BMD z-scores in the total hip area, and z-scores remained stable in the femoral neck and lumbar spine. In the cyproterone acetate group it was found that aBMD remained stable at the femoral neck and lumbar spine but decreased significantly in the hip. There was also a significant decrease in Z-scores of total hip, femoral neck and lumbar spine.
Physical changes. Only four papers [36,37,40,48] reported other physical changes associated with treatment. The most common mentioned effects were amenorrhea, acne, weight gain, hair loss, headaches, fatigue, hot flushes, breast tenderness, mood swings and changes in appetite. Tack et al. (2016) indicated that acne prevalence increased during the first six months of combined Lynestrenol and testosterone treatment, and that metrorrhagia declined in the following 6 months of Lynestrenol treatment.
Mental health outcomes. Five included papers covered mental health outcomes [38,43,45,49,50], which are summarised in Table 11. Three of the studies used the Children's Global Assessment Scale (CGAS [52]), a clinician rating to assess global functioning where high scores (>80) indicate good global functioning. In general, adolescents with GD showed some problems, without severe impairment, on this scale: baseline scores ranged between 55-74. Becker-Hebly et al. (2020) indicated that functioning improved to 'good' (scores between 81-85) after PS monotherapy, PS+CSH combined therapy, and surgical intervention. However, these scores were not compared to the German norm. Costa et al. (2015) reported that adolescents had some problems with global functioning at baseline which significantly improved after 12 months of PS, better than those receiving psychological support alone, and similar to a comparison group. de Vries et al. (2011) reported a significant improvement in global functioning after an average of just under 2 years' PS treatment.
Most measures of mental health status were self-or parent-report. Becker-Hebly et al. (2020) used the Youth / Adult Self Report (YSR / ASR [53,54]) to measure emotional and behavioural problems and observed that scores at baseline were lower compared to the German norm i.e., patients had a high prevalence of problems, and scores barely improved after any level of treatment. In their paper, de Vries et al. (2011) reported that YSR scores decreased significantly, and the percentage of adolescents that scored within the clinical range on the internalising scale decreased significantly following PS. The Body Image Scale (BIS [55]) was administrated in 2 studies: de Vries et al. (2011) noted a higher dissatisfaction in NF than in NM for secondary sex characteristics at follow-up, although no significant changes were noted in BIS over time following PS; Kuper et al. (2020) reported that BIS scores were significantly lower at follow-up in both PS and CSH groups (no data were reported for surgical intervention). Becker-Hebly et al. (2020) utilised the Kidscreen-27 [56] to assess mental dimensions of quality of life. Baseline scores in all groups were below the German norm, with the PS group scoring within the norm at follow-up for both mental and physical quality of life. The other two groups (PS+CSH and surgery) also showed improved quality of life at follow-up, but only the physical health dimensions were within German norms (mental health scores remained lower).  Anxiety symptoms, measured by SCARED [59], also decreased at follow-up. Finally, Russell et al. (2020) found no change in autism symptoms using the Social Responsiveness Scale (SRS-2 [60]) over time undergoing PS. Gender dysphoria symptoms were measured at follow-up in 1 sample: de Vries et al. (2011) reported no change from baseline to follow-up undergoing PS using the Utrecht Gender Dysphoria Scale (UGDS) [38].
What are the long-term outcomes for all (treated or otherwise) in this population?. None of the included papers featured long-term follow-up. One paper, Klaver et al. (2020) [34], included follow-up at 22 years old, with the only remarkable finding being an increase in BMI and obesity beyond the cisgendered population norm.

Quality assessment
The CCAT quality ratings ranged from 71% to 95%, with a mean of 82%. All papers achieved an overall rating of 4 (good, n = 8) or 5 (very good, n = 11), with strengths and weaknesses within certain discrete categories; Papers tended to score higher in Introduction, Preliminaries and Data Collection. One area that many papers scored lower on was the issue of consentdue to many being retrospective in nature some centres waived the requirements of the researchers to seek consent, and some simply did not mention a consent process at all. See Table 12 for full data.

Discussion
This systematic review synthesises research evidence regarding the treatment type, age at treatment, and outcomes for adolescents presenting for assessment for gender dysphoria (GD). We identified 19 papers showing that most centres publishing data provided treatment according to WPATH guidelines relevant at the time (v7) [7]. Young people started PS, usually GnRHa, at a mean age of 14.5, and CSH at a mean age of 16.2 years, although there were very wide ranges around these central points, with children as young as 8 years old starting PS and 13 years old starting CSH. Surgical intervention was uncommon in this sample: 25 participants underwent mastectomy and one vaginoplasty (from 2 and 1 paper respectively), with the lower age range at 15 years old.
Most of the included papers covered a range of exploratory monitoring measures due to this field of study being novel and following a desire to ensure the safety of endocrine interventions with adolescents at a sensitive time in development. These measures mostly generated unremarkable findings: although some changes were observed through PS monotherapy, these tended to resolve once CSH was introduced, and physical development continued according to affirmed gender. Notable findings were in relation to BMD and obesity: BMD appears to decrease with PS but recover with CSH, but findings are heterogeneous; one paper including longer-term follow-up (to age 22 years) showed increased obesity prevalence in both NM and NF. Mental health was measured in a small number of papers, with indications of improvement over time in treatment. Where GD symptoms were measured there was no improvement at follow-up, but these data were from samples undergoing PS monotherapy, and so GD symptoms would reasonably not be expected to improve significantly at this stage (i.e., prior to developing secondary characteristics of identified gender).
A theme common to all three papers in this review series is the clear need for prospective research on large samples from broader populations and with long term follow-up in order to fully understand the implication of intervention during adolescence. We originally set out to study the phenomenon of adolescent-or rapid-onset GD (AOGD) and found an absence of literature, leading to our broader search strategy. There continues debate as to whether AOGD is a genuine phenomenon: Bauer et al. (2022) [61] provided data to suggest it is not, but faced strong rebuttal from both Littman (2022) [62] and Sinai (2022) [63] in terms of the way that AOGD has been defined and clinician experience. It is clear that we simply do not know enough about the observed phenomenon referred to as AOGD, nor do we fully understand the huge increase in numbers of adolescents (and especially NF) presenting for GD intervention in recent years, nor the comorbidities and long-term outcomes.

Strengths and limitations
This review has strength in the broad search strategy and thorough hand screening process applied. There is also strength in this being part of a three-part comprehensive series. However, the limitation of this approach is that time has passed since the initial searches were conducted and new literature has been published which may change the final conclusions of this paper. We have chosen to curtail this final paper to the same end date as the preceding two in the series in the interests of consistency-these three papers should cover the same time period to be considered part of the same overarching review. We conducted a quick search according to our original strategy: this returned 2208 new records without de-duplication. Based on the screening process for the present review, we could expect this to yield about 10 further papers for inclusion. However, as pointed out in the interim report for the Cass review [24,25], good quality evidence is most definitely still lacking. The Cass review [64] is now in a position to conduct more detailed systematic reviews based on a mandate from policy makers, which is a huge step forward in developing this field of evidence and giving it the prominence it deserves. Although UK-based, the quality of this review will have implications for the field internationally. We do not think it would be fruitful to update our review in light of this developing work, although expect our paper series to provide a useful overview whilst the Cass review is ongoing.
The broad initial search criteria led to the need for some narrowing of criteria following initial screening (but prior to full-text screening). The addition of parameters regarding type of publication, upper age of participants, and the clinical verification of GD naturally narrowed the pool of papers and therefore may have meant papers with important findings have been excluded (for example, if a paper included an upper age limit of 21 even though the majority were younger than 18). We endeavoured to record all papers that only narrowly missed inclusion on the age criterion (Table 4), but literature that was excluded on the basis of type (i.e., conference proceedings and grey literature) were not included at this stage and so the potential contribution of this body of work cannot be quantified or assessed. We opted to use a quality assessment tool for studies of diverse designs (CCAT). This allowed all papers to be rated using the same system, but also involved reviewers having to make subjective ratings rather than apply a strictly quantifiable checklist. This may have led to issues with quality, such as over-statement of the significance of findings, not being sufficiently prominent.
Although we were able to include 19 papers from a range of countries in this review, just under a half (8) arose from two well-established treatment centres: those in Amsterdam and London. The Amsterdam team has led the way in developing assessment and treatment protocols for GD and provides a wealth of data over a long period (since 1996 within the included papers), and the London GIDS has, until very recently, been a hub for the whole of the United Kingdom now dealing with hundreds of referrals per year. This presents the advantage of being able to observe the adolescent GD population over a long period of time, assessed using the same or similar tools, and within a relatively stable social context. It is not clear, however, what proportion of young people experiencing GD have access to these national specialist centres and how many may be accessing private facilities or self-medicating with hormones obtained via other routes: we do not know how representative these samples are. Another disadvantage is that most of the papers included in this review are likely to include data from the same samples of participants, also limiting generalisability. The overlap between samples was rarely overtly stated, and there is a risk that readers may add greater weight to collective findings than is warranted. There is a clear lack of research on GD in low and middle income countries in the scientific literature [65], so the impact of different service contexts in countries such as India and Thailand cannot be properly considered [66].

Conclusion
There is a lack of evidence on treatment for GD in adolescence. Although there is a growing body of literature providing data, there are limitations to the scope and quality, and prospective studies with long-term follow-up from a range of centres internationally is required. This review series has highlighted a lack of quality evidence in relation to adolescent GD in general: epidemiology, comorbidity, and treatment impact is difficult to robustly assess. Without an improvement in the scientific field, clinicians, parents, and young people are left ill-equipped to make safe and appropriate decisions.